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Abstract 

Background: Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in 
molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our 
implementation has a strict modular structure, which allows to swap different components of the alignment 
process and, thus, to investigate their contribution to the alignment quality and computation time. We 
systematically varied information sources, guiding trees, score transformations and iterative refinement options, and 
evaluated the resulting alignments on BAIiBASE and SABmark. 

Results: Our results indicate the optimal alignment strategy based on the choices compared. First, we show that 
pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. 
Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive 
alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency 
transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing 
step, which was shown to be comparable with the best consistency transformation in both time and accuracy. 
Finally, propagating information beyond four transitive links introduces more noise than signal. 

Conclusions: This is the first time multiple protein alignment strategies are comprehensively and clearly compared 
using a single implementation platform. In particular, we showed which of the existing consistency transformations 
and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna. 
biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1). 



Background 

A variety of methods used by modern molecular biology 
such as structural modeling, function annotation, phyloge- 
netic analysis and similarity searches are based on multiple 
protein sequence alignments (MPSA). Correct MPSA 
arranges in one column position amino acids that share a 
common ancestor or are functionally/ structurally equiva- 
lent. MPSA provides position-specific information on 
evolutionary conserved characters, correlation between 
characters and their distribution. These features can be 
used in many further applications for which the quality of 
MPSA is, therefore, crucial [1]. Traditionally, the quality of 
the alignment is evaluated using a scoring function based 
on gap penalties and a substitution matrix. When two 
sequences are aligned, an exact solution can be found 
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using dynamic programming [2,3]. When many sequences 
are aligned, the exact solution is computationally expen- 
sive and becomes intractable for more than a few 
sequences [4]. The majority of methods avoid exponential 
scaling of alignment problem by employing various greedy 
heuristics, including a widely used progressive alignment 
technique [5]. Classic progressive methods such as 
ClustalW [6] are fast and can produce reasonable results 
for sequences that are sufficiently similar (e.g., show iden- 
tity higher then 40%) but become impractical in the so 
called twilight zone (e.g., identity below 30%) [1,7]. A vast 
amount of research has been conducted in order to 
improve alignment quality for distant sequences. Many 
methods that have succeeded in this task are based on 
modifying the scoring function. We use a unified graph 
framework to compare different scoring functions. 

In this article we shall use the term "alignment graph" as 
it is defined in the graph-based alignment algorithm by 
Rausch et al. [8] and the corresponding implementation in 
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the SeqAn sequence alignment library [9]. The alignment 
graph represents residues of aligned sequences by vertices 
and various signals on which the alignment is based, such 
as residue substitutions from the substitution model, 
structural links etc., by edges connecting the vertices. This 
provides a flexible model for scoring: any type of informa- 
tion can be introduced by adding edges or weights to the 
existing edges, while certain transformations of the graph 
help to avoid local minima during the optimization step. 

The separation of the input signal for the alignment 
from the process that produces the alignment was pio- 
neered by T-Coffee [10]. This program starts by compiling 
a library of local and global pairwise alignments, which 
equals to populating the alignment graph with the corre- 
sponding edges. T-Coffee was then extended to include 
structural information [11] and to combine alignments 
produced by other methods into a consensus alignment 
[12]. Another example is MaxFlow [13], which initializes 
the alignment graph with information from PSI-BLAST 
searches. We note that, in spite of these implementations, 
the contribution of various information sources to the 
quality of MPSA is yet to be analyzed. Here, we supply 
this need by presenting a clear comparison of sequence 
based information sources. 

Various consistency transformations have been shown 
to effectively improve alignments of distant sequences 
[10,14,15]. The idea is to transform scores used to align 
pairs of sequences to be consistent between different pairs. 
For example, for three sequences A, B and C, if residue A[ 
is aligned to residue Bj and residue Bj is aligned to residue 
Ck, then this implies that a consistent scoring between A 
and C should lead to the alignment of Aj to Ck- In align- 
ment graph notation consistency implies shifting weights 
between neighboring edges. Pioneering work on consis- 
tency was presented by T-Coffee that implemented consis- 
tency transformation referred as triplet library extension. 
In this transformation the consistency of edge between 
vertices Aj and Bj is increased by iterating all possible Ck 
vertices and adding the minimum of (Aj, QJ and (Bj, Ck) 
weights to the (A ; , Bj) weight. This idea has been extended 
in the SeqAn::T-Coffee program, which implements T- 
Coffee algorithm using SeqAn sequence alignment library 
[8]. As in the original T-Coffee, existing edges are modi- 
fied by iteration of neighboring triplets. However, this 
version also introduces new edges where triplet informa- 
tion requires such for consistency. Tests on benchmark 
suggest that this approach improves alignment quality [8]. 

Consistency transformation was also approached in a 
probabilistic framework. For example, in the ProbCons 
[14] implementation, vertices of the alignment graph are 
connected by probabilities according to pairwise hidden 
Markov models (HMMs). These probabilities are then 
transformed towards a higher degree of consistency by 
additively combining probabilities of triplet paths trough 



all possible vertex triplets. This transformation is similar 
to triplet library extension, but since edges are weighted 
by a probability function, the contribution of (A ; , Ck) and 
(Bj, Ck) edge-pair to the (A i( Bj) edge is equal to the pro- 
duct of the corresponding probabilities. ProbCons allows 
repeating the consistency transformation several times, in 
effect extending consistency to sets of four, five or a lar- 
ger number of vertices. 

Consistency and transitivity concepts were unified in 
MaxFlow program, which was designed to align motifs of 
remote homologs that have little sequence similarity when 
compared directly, but can be connected via a path of 
pairwise alignments [13]. MaxFlow starts with an align- 
ment graph based on a library of PSI-BLAST alignments. 
This graph is weighted by assigning each vertex pair a 
consistency score: the number of common neighbors rela- 
tive to the total number of neighbors for the two vertices 
in a pair. The graph is then used to align a pair of distant 
structural homologs. During the alignment residue pairs 
are weighted using the weakest path link: the maximum 
over the path scores defined by the weakest consistency 
score in a path. In comparison with classical methods, 
MaxFlow demonstrated superior performance in both 
reliability and coverage of structural motifs aligned. 

In addition to the scoring function, alignment quality 
is largely affected by guiding tree used in the progressive 
alignment. Our results suggest that single linkage clus- 
tering is optimal for this purpose regardless of the 
benchmark set. 

The progressive alignment method has a serious pitfall: 
subgroups of sequences are aligned independently of one 
another, which implies that an optimal subalignment 
produced near the top of the tree can become a source of 
errors as it is aligned to other subalignments in the later 
stages [16]. Consistency scoring, discussed earlier, is one 
way to address this problem; the other is to correct align- 
ment errors in a post processing step called iterative 
refinement. In one step of iterative refinement MPSA is 
partitioned horizontally into two subalignments, which 
are then realigned and the new alignment is kept if it 
improves scoring function. Iterative methods based on 
simulated annealing are too slow to be practical [16]. 
Other methods differ mainly in the way sequences are 
divided into two groups before being re-aligned [17]. 
This can be a leave one out [18], random partitioning 
[19], or tree-dependent restricted partitioning [16]. The 
last option was shown to be effective in terms of both 
quality and speed, and a variation of this technique has 
been adopted by several current methods including 
MUSCLE [20] and MAFFT [21]. 

Here, we present a novel MPSA program based on the 
SeqAn sequence alignment library. Our implementation 
has a strict modular structure, which allows to swap dif- 
ferent components of the alignment process and, thus, to 
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investigate their contribution to the alignment quality. 
We used our program to see how alignment quality 
changes depending on the input information, guiding 
trees, the applied consistency transformations, and the 
strategy for iterative refinement during post-processing. 
This is the first time these strategies are comprehensively 
and clearly compared using a single implementation plat- 
form. Our results confirm the existing knowledge on 
which strategies are efficient. We also show that the best 
strategy is comparable in accuracy to the best software in 
the field. 

Results 

To compare alignment strategies we systematically var- 
ied information sources, guiding trees, consistency, cli- 
que transformation and iterative refinement options, 
evaluating the resulting alignments on BAliBASE and 
SABmark. 

Comparison of information sources 

We found that using both global and local pairwise 
alignments to construct the alignment graph, resulted in 
high quality multiple alignments for both BAliBASE 
(Figure 1) and SABmark (Figure 2A and 2B) benchmark 
databases. Adding longest common subsequences to glo- 
bal and local alignments had minor effect on the quality 
of the multiple alignments. Adding external information 
in the form of GTG motifs, extracted using motif track- 
ing as described in the original GTG article [22], 
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Figure 1 Performance of strategies with different information 
sources on BAliBASE. Boxplots show the sum-of-pairs (SP) score 
achieved by different strategies for alignments in the BAliBASE 
benchmark database. Boxplots display first, second and third 
quartiles as vertical lines; outliers are shown as pluses. The strategies 
tested differ in the combination of pairwise sequence information 
that is used to construct the alignment graph. Combinations 
include: C, longest common subsequence, L, the four top scoring 
local alignments, G, global alignment, M, GTG motifs. 
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Figure 2 Performance of strategies with different information 
sources on SABmark. Boxplots show developer (FD) score 
(equivalent to sum-of-pairs [SP] score) achieved by different 
alignment strategies for (A) 'Twilight Zone" and (B) "Superfamily" 
sets in the SABmark benchmark database. Boxplots display first, 
second and third quartiles as vertical lines; outliers are shown as 
pluses. The strategies tested differ in the combination of pairwise 
sequence information that is used to construct the alignment 
graph. Combinations include: C, longest common subsequence, L, 
the four top scoring local alignments, G, global alignment, M, GTG 
motifs. 



increased the quality of the multiple alignments by an 
average of 1% for both BAliBASE and SABmark (full 
data available in Additional file 2: Tables S2 and S3). 

Guiding trees 

We found that single linkage clustering is clearly the 
best option for all reference sets in BAliBASE and SAB- 
mark benchmarks (see Tables 1 and 2). To our surprise, 
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Table 1 Comparing guiding tree performance on BAMBASE 



RV11 (38) RV12(44) RV20 (41) RV30 (30) RV40 (49) RV50 (16) Overall (218) 



Guiding tree 


SP 


TC 


SP 


TC 


SP 


TC 


SP 


TC 


SP 


TC 


SP 


TC 


SP 


TC 


Time(sec) 


NJ 


0.554 


0.275 


0.877 


0.664 


0.881 


0.280 


0.798 


0.366 


0.864 


0.413 


0.855 


0.450 


0.805 


0.408 


23.71 


MIN 


0.587 


0.286 


0.908 


0.772 


0.902 


0.357 


0.832 


0.457 


0.884 


0.483 


0.871 


0.540 


0.831 


0.483 


24.00 


MAX 


0.575 


0.284 


0.901 


0.744 


0.888 


0.313 


0.822 


0.393 


0.850 


0.385 


0.839 


0.389 


0.813 


0.418 


23.30 


UPGMA 


0.585 


0.302 


0.905 


0.761 


0.891 


0.329 


0.828 


0.431 


0.863 


0.422 


0.859 


0.469 


0.822 


0.452 


23.32 


WUPGMA 


0.583 


0.294 


0.905 


0.761 


0.896 


0.329 


0.829 


0.432 


0.860 


0.423 


0.850 


0.453 


0.821 


0.448 


23.52 



Columns show the average sum-of-pairs (SP) and true-column (TC) scores achieved by different alignment strategies for each of the six BAIiBASE references. The 
number of alignments in each reference is given in parentheses. Overall values for the entire database are reported in addition to the mean execution time of 
each strategy. The best results in each column are shown in bold. The strategies differ in the algorithm used to build the guiding tree: neighbor joining (NJ), 
single linkage (MIN), complete linkage (MAX), UPGMA and weighted UPGMA (WUPGMA) 



it produced significantly better alignments than com- 
monly used neighbor joining. For BAIiBASE, single link- 
age increased accuracy relative to neighbor joining by 
3% and for SABmark benchmark by 5%. We note that 
this comparison is based on alignments produced with 
no consistency transformation or postprocessing to 
account for errors introduced by heuristic nature of pro- 
gressive alignment. It is probable that when consistency 
or postprocessing is included the particular type of the 
guiding tree used becomes less relevant. 

Consistency and clique transformations 

Our results show that triplet library extension improves 
alignment accuracy for all sets in both benchmarks when 
compared to alignments produced with no consistency 
transformation (referred here as the basic strategy). 
Triplet T . Co ff ee , which is limited to applying consistency to 
the existing edges, improved BAIiBASE alignments by 
1.7%, SABmark Twilight Zone alignments by 3.3% and 
SABmark Superfamily alignments by 1% (see Tables 3 
and 4). Triplet SeqA n> which introduces new edges when 
implied by consistency, was considerably better, enhan- 
cing BAIiBASE accuracy by 3%, SABmark Superfamily 
accuracy by 3.1% and SABmark Twilight alignment 

Table 2 Comparing guiding tree performance on 
SABmark 



Twilight zone (209) Superfamily (425) 



Guiding tree 


FD 


FM 


Time(sec) 


FD 


FM 


Time(sec) 


NJ 


22.63 


16.69 


0.56 


51.26 


39.71 


0.50 


MIN 


23.75 


17.29 


0.53 


53.88 


41.27 


0.49 


MAX 


23.02 


16.83 


0.54 


52.42 


40.31 


0.50 


UPGMA 


23.35 


16.97 


0.54 


53.29 


40.91 


0.50 


WUPGMA 


23.42 


17.01 


0.53 


53.31 


40.90 


0.50 



Columns show the average developer (FD) score (equivalent to sum-of-pairs 
[SP] score), modeller (FM) score and run time achieved by different alignment 
strategies for the "Superfamily" and 'Twilight Zone" sets in the SABmark 
database. The number of alignments in each reference is given in 
parentheses. The best results in each column are shown in bold. The 
alignment strategies differ in the algorithm used to build the guiding tree: 
neighbor joining (NJ), single linkage (MIN), complete linkage (MAX), UPGMA 
and weighted UPGMA (WUPGMA) 



accuracy by a remarkable 11.2%. Reiterating triplet SeqAn 
transformation for the second time either enhanced or 
deteriorated alignment accuracy depending on the parti- 
cular set. The effect was small and has little practical 
value. 

Applying clique transformation decreased alignment 
accuracy when compared to the basic strategy. Applying 
clique transformation after the graph has been made con- 
sistent by triplet T _ Co ff ee generally provided an additional 
improvement relative to triplet T _ Coffee . However, this is not 
practical since this transformation multiplied the required 
computational time by a factor of 8 for SABmark sets and 
by a factor of 65 for the BAIiBASE sets. This is due to the 
huge amount of edges generated by the clique transforma- 
tion which makes the progressive alignment step compu- 
tationally demanding. The task appeared even intractable 
when applying clique transformation after triplet Seq An con- 
sistency for some of the BAIiBASE alignments, for which 
reason we do not report on these results here. Running 
clique transformation after triplet SeqAn consistency for 
SABmark alignments decreased alignment accuracy. This 
suggests that triplet SeqAn is sufficient to introduce edges 
that have practical utility for the MPSA. Combining Max- 
Flow consistency with clique transformation had minor 
improvement on BAIiBASE alignments and deteriorated 
quality of SABmark alignments. 

Iterative refinement 

All iterative refinement strategies improved alignment 
accuracy, but the tree dependent strategies were more 
efficient. It made little difference, whether the partition- 
ing was done randomly (tree Ran d om ) or systematically in 
breath-first order (tree B F)- Random partitioning 
improved accuracy of BAIiBASE alignments by 1%, 
tree Ranc [ om by 2.6% and tree BF by 2.8%, relative to align- 
ments produced with no refinement (see Table 5). 
Improvement of accuracy for SABmark Twilight align- 
ments were 5.9%, 9% and 8.5%, respectively; and for 
SABmark Superfamily alignments 1.6%, 3.2% and 2.9%, 
respectively (Table 6). Interestingly, the tree dependent 
strategies are comparable in time and accuracy to the 
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Table 3 Comparing score transformation strategies on BAIiBASE 





RV11 (38) 


RV12 (44) 


RV20 (41) 


RV30 (30) 


RV40 (49) 


RV50 (16) 


Overall (218) 




Strategy 


SP 


% 
basic 


SP 


% 
basic 


SP 


% 
basic 


SP 


% 
basic 


SP 


% 
basic 


SP 


% 
basic 


SP 


% 
basic 


Time 

(sec) 


basic 


0.587 


100.0 


0.908 


100.0 


0.902 


100.0 


0.832 


100.0 


0.884 


100.0 


0.871 


100.0 


0.831 


100.0 


24.44 


maxflow 


0.585 


99.6 


0.910 


100.2 


0.909 


100.8 


0.836 


100.4 


0.895 


101.2 


0.880 


100.9 


0.836 


100.6 


25.04 


tripleWfe, 


0.613 


104.4 


0.914 


100.6 


0.914 


101.3 


0.840 


100.9 


0.901 


101.9 


0.888 


101.9 


0.845 


101.7 


27.85 


triplet 5M „ 


0.642 


109.3 


0.929 


102.3 


0.918 


101.8 


0.839 


100.8 


0.911 


103.1 


0.895 


102.7 


0.856 


103.0 


44.83 


triplets^ x 2 


0.648 


110.3 


0.928 


102.2 


0.917 


101.7 


0.842 


101.1 


0.910 


103.0 


0.896 


102.8 


0.857 


103.1 


608.79 


clique 


0.530 


90.2 


0.896 


98.7 


0.820 


90.9 


0.614 


73.7 


0.758 


85.8 


0.675 


77.4 


0.715 


86.1 


1 764.83 


tripletj-.ee/fe + 
clique 


0.637 


108.4 


0.930 


102.4 


0.919 


101.8 


0.839 


100.8 


0.905 


102.4 


0.885 


101.6 


0.852 


102.6 


1 753.63 


maxflow + clique 


0.590 


100.4 


0.924 


101.8 


0.904 


100.2 


0.819 


98.3 


0.876 


99.1 


0.879 


100.8 


0.832 


100.1 


2337.67 



Columns show the average sum-of-pairs (SP) score and the percentage value of the SP score relative to the basic strategy (%basic) achieved by each alignment 
strategy for each of the six BAIiBASE references. The number of alignments in each reference is given in parentheses. Overall values for the entire database are 
reported in addition to the mean execution time of each strategy. The best results in each column are shown in bold 



triplet SeqAn) which was the best consistency transforma- 
tion as presented above. The mean SP score for BAIi- 
BASE alignments produced with triplets eq An was 0.856, 
the corresponding score for tree B p was very similar: 
0.854. Computation times for these two strategies were 
also comparable: triplet SeqAn took on average 45 seconds 
and tree BF on average 54 seconds per BAIiBASE align- 
ment. Moreover, applying tree B F iterative refinement on 
BAIiBASE alignments produced under triplet Seq An con- 
sistency had no effect on accuracy when measured up to 
three decimal points of the SP score (Table 5). The 
same tendency can be seen from SABmark alignments: 
differences between triplet SeqA n and tree BF accuracy are 
rather cosmetic and tree B p is on average slightly faster 
(1.2 sec per alignments against two seconds per align- 
ment for Twilight alignments and one sec against 1.6 
sec for Superfamily alignments). For this benchmark, 
however, applying tree BF refinement to alignments pro- 
duced under triplet Seq An did result in some improve- 
ment (Table 6). 



The relative speed of iterative refinement strategies 
depended largely on the number of sequences in the align- 
ment. For BAIiBASE alignments, which on average contain 
30 sequences, run times of random and restricted parti- 
tioning were comparable. For SABmark alignments, which 
on average contain 8 sequences, restricted partitioning 
was slightly faster. Partitioning alignment along all possible 
edges in the guiding tree was generally faster than random 
tree dependent partitioning for alignments with a small to 
moderate number of sequences (BAIiBASE sets RV11, 
RV12, RV40 and RV50 and both SABmark sets), but 
became slower as the number of aligned sequences 
approached the number of refinement iterations (BAIi- 
BASE sets BB20 and BB30 that have on average 46 and 63 
sequences per alignment, respectively). 

Comparison to other programs 

We compared our optimal strategies to the eight leading 
multiple alignment programs: (1) ProbCons 1.12 [14], (2) 
T-Coffee 8.98 [10], (3) MUSCLE 3.8 [20], (4) MAFFT 



Table 4 Comparing score transformation strategies on SABmark 



Strategy 




Twilight zone (209) 






Superfamily (425) 




FD 


% basic 


Time (sec) 


FD 


% basic 


Time (sec) 


basic 


24.05 


100.0 


1.10 


53.74 


100.0 


0.95 


tripletT-coflfee 


24.83 


103.3 


1.10 


54.30 


101.0 


0.90 


triplet 5M „ 


26.74 


111.2 


1.97 


55.40 


103.1 


1.60 


tripletj^n x 2 


25.87 


107.6 


17.38 


55.45 


103.2 


14.39 


triplet SM „ x 3 


20.78 


86.4 


77.65 


50.77 


94.5 


68.59 


clique 


20.05 


83.4 


7.63 


48.66 


90.5 


6.35 


tripletr-cozfe + clique 


25.28 


105.1 


9.54 


54.65 


101.7 


7.97 


triplet Set?/ ,„ + clique 


25.36 


105.4 


11.55 


54.46 


101.3 


7.90 


maxflow + clique 


21.82 


90.7 


10.34 


52.02 


96.8 


8.70 



Columns show the average developer (FD) score, the percentage value of the FD score relative to the basic strategy (%basic) and the average run time achieved 
by different alignment strategies for the "Superfamily" and "Twilight Zone" sets in the SABmark database. The number of sequences in each reference is given in 
parentheses. The best results in each column are shown in bold 
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Table 5 Iterative refinement versus consistency, compared on BAMBASE 



Strategy 


RV11 (38) 


RV12 (44) 


RV20 (41) 


RV30 (30) 


RV40 (49) 


RV50 (16) 


Overall (218) 


Time 
(sec) 




SP 


% 

basic 


SP 


% 

basic 


SP 


% 

basic 


SP 


% 

basic 


SP 


% 

basic 


SP 


% 

basic 


SP 


% 

basic 




basic 


0.587 


100.0 


0.908 


100.0 


0.902 


100.0 


0.832 


100.0 


0.884 


100.0 


0.871 


100.0 


0.831 


100.0 


24.4 


random x 100 


0.623 


106.1 


0.922 


101.5 


0.902 


100.0 


0.832 


100.0 


0.885 


100.1 


0.871 


100.0 


0.839 


101.0 


40.6 


tree/wom x 1 00 


0.625 


106.4 


0.929 


102.3 


0.918 


101.7 


0.846 


101.6 


0.909 


102.8 


0.890 


102.2 


0.853 


102.6 


39.4 


tree fifs x 1 00 


0.626 


106.5 


0.927 


102.1 


0.916 


101.5 


0.846 


101.6 


0.912 


103.2 


0.898 


103.1 


0.854 


102.8 


54.1 


triplets^ 


0.642 


109.3 


0.929 


102.3 


0.918 


101.8 


0.839 


100.8 


0.911 


103.1 


0.895 


102.7 


0.856 


103.0 


44.8 


triplets«, A ,+ tree BFS x 
100 


0.643 


109.5 


0.930 


102.4 


0.917 


101.7 


0.842 


101.1 


0.909 


102.8 


0.894 


102.6 


0.856 


103.0 


195.4 



Columns show the average sum-of-pairs (SP) score and the percentage value of the SP score relative to the basic strategy (%basic) achieved by each alignment 
strategy for each of the six BAIiBASE references. The number of alignments in each reference is given in parentheses. Overall values for the entire database are 
reported in addition to the mean execution time of each strategy. The best results in each column are shown in bold 



6.847 [21], (5) ClustalW 2.0.12 [6], (6) Clustal Omega 1.0.3 
[23], (7) Kalign 2.03 [24] and (8) MSAProbs 0.9.5 [25]. 
MAFFT was run with L-INS-i option; other programs 
were run with default parameters. All programs were run 
with a single core. The comparison was done on BAIi- 
BASE (Table 7), SABmark (Table 8) and PREFAB (Table 
9). The optimal strategy for BAIiBASE alignments, referred 
as MMSA in Table 7, was to use global and local pairwise 
alignments complemented with GTG motifs as input 
information; to apply consistency transformation triplet Se _ 
q AnJ and to align the sequences using a single linkage guid- 
ing tree. No postprocessing was performed. For SABmark 
we chose to test two strategies: the same strategy as for 
BAIiBASE, and another strategy, that was augmented with 
iterative refinement of type treeBF (referred as MMSA:: 
tree BF in Table 8). On BAIiBASE our aligner ranked fourth 
in terms of alignment accuracy. MSAProbs, MAFFT and 
ProbCons were more accurate, while Kalign, MUSCLE, 
MAFFT and both Clustals had better time performance. 
Also for SABmark our aligner ranked fourth in term of 
accuracy, this time it was MSAProbs, ProbCons and T- 
Coffee that performed better. The fastest were again 
Kalign, both Clustals, MUSCLE and MAFFT. 

For PREFAB we chose to use the same strategy as for 
BAIiBASE. On the main set this yielded moderate 



accuracy outperforming only the fast aligners: MUSCLE, 
Kalign and ClustalW. On the weighted reference set our 
code performed poorly, since it does not account for the 
overrepresented families (Table 9). 

We conclude that, although our modular implementa- 
tion of the best alignment strategies does not outper- 
form the best aligners in the field, the performance is at 
a comparable level. 

Discussion 

We have completed a comprehensive comparison of 
graph based strategies for aligning multiple sequences. 
Our results suggest clear guidelines for a number of 
choices made during construction of a multiple protein 
sequence alignment. First, we showed that pairwise global 
and local alignments contain sufficient information to 
construct a high quality multiple alignment. When reli- 
able external information, the GTG motifs in our case, is 
available, it will most likely improve the accuracy and 
thus should also be included. Second, single linkage clus- 
tering is almost invariably the best algorithm to build a 
guiding tree for progressive alignment. Third, triplet Seq An 
is the most efficient consistency transformation from 
those compared. It can have a large improvement on 
alignment quality, particularly for alignments in the 



Table 6 Iterative refinement versus consistency, compared on SABmark 



Strategy 




Twilight zone (209) 






Superfamily (425) 




FD 


% basic 


Time (sec) 


FD 


% basic 


Time (sec) 


basic 


24.05 


100.0 


1.10 


53.74 


100.0 


0.95 


random x 100 


25.47 


105.9 


1.64 


54.62 


101.6 


1.49 


tree Rondom x 1 00 


26.21 


109.0 


1.51 


55.43 


103.2 


1.30 


tree BFS x 1 00 


26.08 


108.5 


1.19 


55.29 


102.9 


0.98 


triplets^,, 


26.74 


111.2 


1.97 


55.40 


103.1 


1.60 


triplet Sei!A „ + tree BFS x 100 


27.39 


113.9 


7.35 


55.51 


103.3 


5.75 



Columns show the average developer (FD) score, the percentage value of the FD score relative to the basic strategy (%basic) and the average run time achieved 
by different alignment strategies for the "Superfamily" and 'Twilight Zone" sets in the SABmark database. The number of alignments in each reference is given in 
parentheses. The best results in each column are shown in bold 
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Table 7 Performance of aligners on the BAMBASE benchmark database 



SP 



Aligner 


RV11 (38) 


RV12 (44) 


RV20 (41) 


RV30 (30) 


RV40(49) 


RV50 (16) 


Overall (218) 


MMSA 


0.642 


0.929 


0.918 


0.839 


0.911 


0.895 


0.856 


ProbCons 


0.670 


0.941 


0.917 


0.845 


0.903 


0.894 


0.862 


T-Coffee 


0.656 


0.939 


0.914 


0.837 


0.893 


0.895 


0.856 


MUSCLE 


0.572 


0.915 


0.889 


0.814 


0.865 


0.835 


0.815 


MAFFT 


0.662 


0.935 


0.927 


0.868 


0.926 


0.903 


0.870 


ClustalW 


0.501 


0.865 


0.852 


0.725 


0.789 


0.742 


0.746 


ClustalO 


0.590 


0.906 


0.902 


0.862 


0.902 


0.862 


0.837 


Kalign 


0.605 


0.912 


0.900 


0.813 


0.884 


0.820 


0.822 


MSAProbs 


0.682 


0.946 


0.928 


0.865 


0.925 


0.908 


0.876 










Time(sec) 








Aligner 


RV11 (38) 


RV12 (44) 


RV20 (41) 


RV30 (30) 


RV40(49) 


RV50 (16) 


Overall (218) 


MMSA 


0.817 


1.772 


69.797 


1 1 9.424 


38.725 


38.453 


44.831 


ProbCons 


2.481 


7.052 


142.641 


297.473 


108.307 


1 24.235 


113.698 


T-Coffee 


15.058 


23.310 


259.227 


646.212 


202.948 


210.856 


226.269 


MUSCLE 


0.701 


0.824 


5.490 


7.538 


9.519 


7.002 


5.179 


MAFFT 


0.840 


1.580 


14.071 


25.811 


27.455 


1 8.434 


14.698 


ClustalW 


0.310 


0.923 


13.281 


25.231 


8.155 


8.982 


9.480 


ClustalO 


0.406 


0.747 


4.224 


4.824 


5.214 


4.790 


3.368 


Kalign 


0.054 


0.090 


0.435 


0.562 


0.559 


0.411 


0.352 


MSAProbs 


2.092 


6.436 


141.371 


295.506 


86.463 


102.198 


105.678 



The values in the upper table show the average sum-of-pairs (SP) score and the values in the lower table mean run time achieved by different aligners for each 
of the six BAIiBASE references. The number of alignments in each reference is given in parentheses. Overall values for the entire database are reported in the last 
column. The best results in each column are shown in bold 



twilight zone. Alternatively, one can apply tree dependent 
partitioning as a post processing step, which was shown 
to be comparable with triplet SeqAn consistency transfor- 
mation in both time and accuracy. 

As a more subtle result, we found that transitivity, 
which in principle can increase the sensitivity of the 
aligner, will in most cases introduce more noise than 
signal. When we iterated triplet SeqAn two times, in effect 
transferring edge weights within neighborhoods of four 
vertices, the quality of the alignments increased slightly. 
However, when iterations were repeated three times, 



extending the influenced neighborhood to five vertices, 
the quality invariably deteriorated. The same was the 
case for clique transformation, where any two vertices, 
connected by a path, were connected by an edge and 
thus were allowed to directly influence each other dur- 
ing the progressive alignment. 

Conclusions 

We showed that graph based modular implementation 
allows to compare the contribution of different algorithmic 
components to the alignment quality and computation 



Table 8 Performance of aligners on the SABmark benchmark database 



Aligner 




Twilight zone (209) 




Superfamily (425) 


FD 


FM 


Time(sec) 


FD 


FM 


Time(sec) 


MMSA 


26.74 


19.13 


1.97 


55.40 


42.10 


1.60 


MMSA::treeBF 


27.39 


19.45 


7.35 


55.51 


42.07 


5.75 


ProbCons 


28.23 


20.65 


1.36 


56.26 


43.26 


1.37 


T-Coffee 


27.75 


20.50 


2.34 


55.86 


43.03 


2.21 


MUSCLE 


23.32 


16.28 


0.34 


52.43 


39.60 


0.36 


MAFFT 


25.70 


18.84 


0.70 


55.01 


42.22 


0.55 


ClustalW 


21.85 


14.80 


0.21 


50.29 


37.76 


0.21 


ClustalO 


22.17 


16.49 


0.22 


52.63 


40.85 


0.17 


Kalign 


21.87 


15.24 


0.06 


49.70 


38.02 


0.05 


MSAProbs 


28.57 


20.86 


1.1 1 


57.13 


43.92 


1.16 



Columns show the average developer (FD) score, modeller (FM) score and run time achieved by different aligners for the 'Twilight Zone" and "Superfamily" sets 
in the SABmark database. The number of alignments in each reference is given in parentheses. The best results in each column are shown in bold 
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Table 9 Performance of aligners on the PREFAB benchmark 








Identity range (main set) 






Aligner 


0-0.2 (1104) 


0.2-0.4 (400) 


0.4-0.7 (109) 


0.7-1 (69) 


Overall (1682) 


Time(sec) 


MMSA 


0.58 


0.88 


0.98 


0.98 


0.69 


21.47 


ProbCons 


0.61 


0.90 


0.97 


0.98 


0.72 


46.68 


T-Coffee 


0.60 


0.89 


0.97 


0.98 


0.71 


110.66 


MUSCLE 


0.56 


0.88 


0.97 


0.98 


0.68 


2.00 


MAFFT 


0.62 


0.90 


0.98 


0.98 


0.72 


4.16 


ClustalW 


0.48 


0.84 


0.98 


0.99 


0.62 


4.83 


ClustalO 


0.584 


0.895 


0.976 


0.987 


0.700 


1.419 


Kalign 


0.525 


0.845 


0.977 


0.984 


0.649 


0.160 


MSAProbs 


0.635 


0.913 


0.974 


0.979 


0.737 


49.192 








Identity range (weighted set) 






Aligner 


0-0.2 (59) 


0.2-0.4 (30) 


0.4-0.7 (4) 


0.7-1 (7) 


Overall (100) 


Time(sec) 


MMSA 


0.43 


0.82 


0.96 


0.98 


0.60 


2.82 


ProbCons 


0.50 


0.87 


0.93 


0.99 


0.66 


14.18 


T-Coffee 


0.46 


0.75 


0.93 


0.99 


0.60 


33.74 


MUSCLE 


0.42 


0.84 


0.95 


0.99 


0.61 


1.32 


MAFFT 


0.46 


0.87 


0.96 


0.99 


0.64 


1.31 


ClustalW 


0.39 


0.88 


0.96 


0.99 


0.60 


1.60 


ClustalO 


0.488 


0.876 


0.964 


0.989 


0.658 


1.107 


Kalign 


0.435 


0.876 


0.963 


0.989 


0.627 


0.114 


MSAProbs 


0.484 


0.881 


0.964 


0.990 


0.658 


1 2.660 



Columns show the average Q (equivalent to SP) score achieved for subsets of PREFAB benchmark database, that correspond to different identity values of the 
structural pair. The number of alignments in each identity range is given in parenthesis. Overall average scores and run time are given in the last two columns. 
Values for the main and weighted sets are given separately. The best results in each column are shown in bold 



time. We demonstrated that shifting edge information 
within triplets of alignment graph vertices prior to the pro- 
gressive step and the tree-dependent iterative refinement 
after the progressive step are equally effective strategies 
that significantly improve accuracy, particularly for the case 
of distant homologs. As a negative result we report that 
extending the neighborhood of edge shifting to five or 
more vertices introduces more noise than signal. 

Methods 

Algorithm overview 

Our program starts by gathering input information from 
pairwise alignments into an alignment graph. The graph is 
then refined, transformed and finally fed to a progressive 
alignment routine that builds an alignment. The resulting 
alignment is iteratively refined in a post processing step. 
Altogether there are six steps, detailed below. Each step 
was implemented as a combination of SeqAn library 
methods and our own code. The contribution of SeqAn 
library is outlined in Additional file 2: Table SI. 
1. Computing information sources 

Global, local and longest common subsequence align- 
ments were computed using SeqAn library with Blosum62 
substitution matrix, gap opening penalty set to -13 and 
gap extension penalty set to -1. Global alignments were 
constructed using Gotoh's affine gap cost and local 



alignments using Smith- Waterman algorithms. From local 
alignments we selected the best four, possibly overlapping, 
matches; as our preliminary tests indicate that including a 
larger number of local matches has little effect on accuracy 
(data not published). We acknowledge that the selection of 
substitution matrix and gap penalties will have a large 
effect on the pairwise alignments and hence on the overall 
MPSA. The values applied here are based on the literature 
and other implementations. 

Additionally, we compiled a collection of conserved 
motifs based on the Global Trace Graph (GTG) [22]. 
These were extracted using motif tracking as described 
in the original GTG article. Alignments and motifs were 
converted into edges between aligned sequence seg- 
ments referred hereafter as segment matches. 
2. Initializing the alignment graph 

The vertices of the graph represent segments of the 
sequence in the set of proteins to be aligned. The edges of 
the graph represent possible alignments between these 
segments. After the edges are generated in the first step, 
they are assigned weights using Blosum62 substitution 
matrix. Blosum62 was offset to include only positive 
values. After this, the structure of the alignment graph is 
simplified by refinement. During the refinement segment 
matches are cut into smaller parts in such a way that none 
of the matches partly overlap (for details see [8]). 
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3. Score transformation 

Score transformations can both modify the edge weights 
of existing edges, and create new edges. An overview of 
score transformations is presented in Additional file 3: 
Figure SI. 

3A. Consistency transformation Alignment graph is 
made consistent using either triplet library extension 
introduced in T-Coffee, triplet T _ Cof f ee , a variation of this 
introduced in SeqAn, triplet SeqAn , or the consistency 
measure used in MaxFlow [13]. Triplet SeqAn can be 
repeated several times, in effect extending consistency to 
sets of four, five or larger number of vertices. 
3B. Clique transformation Connected components are 
identified using Kruskal's algorithm and converted to 
cliques. Kruskal's is used to cluster alignment graph ver- 
tices by iterating all edges in the order of decreasing 
edge weight. This produces a collection of spanning 
trees, which are used to connect each pair of vertices in 
a tree with a weight equal to the weakest link of the 
path connecting the two vertices. 

4. Tree construction 

When global alignments are included in the list of infor- 
mation sources (see above), they are used to calculate the 
distance matrix. Otherwise, the distance matrix is calcu- 
lated using k-mer counting. A guiding tree is constructed 
from the distance matrix using neighbor joining, single 
linkage, complete linkage, UPGMA or weighted UPGMA. 
When the guiding tree is constructed using neighbor join- 
ing, the root is placed between the last two taxa to be 
joined. 

5. Progressive alignment 

Sequences are progressively aligned in the order defined 
by the guiding tree. In each step, we apply heaviest com- 
mon subsequence algorithm to align two subalignments. 
This is a segment based dynamic programming that 
finds the maximum weight trace from a set of refined 
alignment segments. 

6. iterative refinement 

We implemented three different refinement strategies: 
random partitioning introduced by Berget & Munson 
[19], and the two tree-dependent restricted partitioning 
strategies referred as tree Random and tree BF - Tree Random 
is the same strategy as used by MUSCLE and MAFFT: 
sequences are partitioned by cutting a random edge of 
the guiding tree. Tree B F was designed to be a time effi- 
cient version of tree Ran d om : each edge of the tree is cut 
only once in the breath-first order. 

Testing methodology 

We run our program with options corresponding to 
each of the tested strategies on BAliBASE 3.0 [26] and 
SABmark 1.65 [27] alignment benchmark databases. 
The best strategy was also tested on PREFAB 4.0 [20] 



benchmark. Tests were performed on a 2.93 GHz Intel 
Xeon X7350 with 66 GB RAM. 

The BAliBASE 3.0 benchmark is a collection of 218 
reference protein alignments, compiled from FSSP struc- 
tural alignments [28], HOMSTRAD [29] databases and 
hand constructed alignments. The database is organized 
into five reference sets: in Reference 1 sequences are 
equidistant and have two levels of conservation; in Refer- 
ences 2 a highly divergent "orphan" is aligned with a 
family of close homologs; in Reference 3 subgroups with 
< 25% residue identity between groups are aligned; Refer- 
ence 4 contains sequences with large N/C-terminal 
extensions; and Reference 5 sequences with large inser- 
tions in the middle of the aligned blocks. Alignments 
were scored relative to the core blocks of benchmark 
multiple alignments using quality measures proposed by 
BAliBASE: the sum of pairs score (SP) and the true col- 
umn score (TC). SP is equal to the number of residue 
pairs correctly aligned in the evaluated alignment relative 
to the total number of residue pairs in the benchmark 
alignment. TC is the number of columns with correct 
alignment divided by the total number of columns in the 
benchmark alignment. 

The SABmark 1.65 benchmark contains pairwise struc- 
tural alignments from SOFI [30] and CE [31] databases, 
that are organized according to SCOP classification. 
There are two sets: the "Twilight Zone" set contains 1740 
domains grouped into 209 SCOP folds, and the "Superfa- 
milies" set that contains 3280 domains grouped into 425 
SCOP superfamilies. Sequences in each of the twilight 
subsets are restricted to have at most 25% and those in 
superfamilies subsets to have at most 50% pairwise iden- 
tity. The quality of SABmark alignments was evaluated 
using FD and FM scores based on reference structural 
alignments. The FD score measures the number of cor- 
rect residue pairs relative to the number of paired resi- 
dues in the reference alignment and FM score relates the 
number of correct pairs to the total number of paired 
residues in the test alignment. 

PREFAB 4.0 is a large database generated automati- 
cally by supplementing structural pairs from FSSP data- 
base with homologs found through PSI-BLAST queries. 
Each alignment set is filtered to have at most 80% iden- 
tity and limited to a set of 50 random homologs. There 
are 1682 alignments in the main set and 100 alignments 
in the weighted set. The weighted set consists of families 
with one highly over-represented sub-family. The accu- 
racy of the multiple alignment for a given PREFAB 
alignment set is evaluated relative to the consensus of 
FSSP and CE alignments of the corresponding structural 
pair. Alignments were evaluated using PREFAB Q score, 
which is the analog of BAliBASE SP and SABmark FD 
scores. 
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Additional material 



Additional file 1: Source code, Makefile, installation instructions and 
test alignments. 

Additional file 2: Table SI. Outline of MMSA implementation. This table 
presents the main alignment steps and the corresponding components 
and functions. Some functions are referred in the main text using 
abbreviations listed in the fourth column. Source code for any function 
can be found in one of the two packages: SegAn seguence alignment 
library or MMSA code deposited at http://ekhidna.biocenter.helsinki.fi/ 
MMSA. Table S2. Performance of strategies with different information 
sources on the BAIiBASE benchmark. Columns show the average sum-of- 
pairs (SP) and true-column (TC) scores achieved by different alignment 
strategies for each of the six BAIiBASE references. The number of 
sequences in each reference is given in parentheses. Mean values for the 
entire database are reported in addition to the mean execution time of 
each strategy. The best results in each column are shown in bold. The 
strategies tested differ in the combination of pairwise sequence 
information that is used to construct the alignment graph. Combinations 
include: C, longest common subseguence, L, the four top scoring local 
alignments, G, global alignment, M, GTG motifs. Table S3. Performance 
of strategies with different information sources on the SABmark 
benchmark Columns show the average developer (FD) score (equivalent 
to sum-of-pairs [SP] score) and modeller (FM) score achieved by different 
alignment strategies for the "Superfamily" and 'Twilight Zone" sets in the 
SABmark database. The number of seguences in each reference is given 
in parentheses. Mean execution time of each strategy is reported in 
seconds. The best results in each column are shown in bold. The 
strategies tested differ in the combination of pairwise sequence 
information that is used to construct the alignment graph. Combinations 
include: C, longest common subseguence, L, the four top scoring local 
alignments, G, global alignment, M, GTG motifs. 

Additional file 3: Figure SI . Overview of the score transformations. A) 
Alignment graph of four seguences after refinement has resolved 
conflicts between segment matches. B) Applying triplet T _ Co ffee reinforces 
connections within triplet digues. C) Applying triplet SeqAn reinforces 
connections within triplet digues and introduces new edges between 
vertices that have a common neighbor. Note that edges are never 
introduced between vertices that belong to the same sequence. D) 
Applying MaxFlow consistency followed by a clique transformation: 
edges within spanning trees are weighted by the relative number of 
common neighbors, and new edges are introduced to convert each 
spanning tree into a digue (or almost a digue, since edges are never 
introduced between vertices that belong to the same seguence). 
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